GWAS Bootcamp

Genome-Wide Association Studies in Autopolyploids

Felipe Ferrão

University of Florida – July, 2022

Objectives:

  • 40 minutes of hands-on
  • Using R for GWAS analyses
  • GWASpoly package

Extra information


Introduction

Genomic-wide association analyses (GWAS) measure hundreds of thousands of genetic variants (typically Single Nucleotide Polymorphisms, or SNPs), in hundreds with the primary goal being to identify which regions of the genome harbor SNPs that affect some phenotype or outcome of interest. The focus of this tutorial is on GWA analysis of common variants that involves testing association of each SNP independently and subsequently characterizing findings through a variety of visual and analytic tools.

In this tutorial, we will describe the genetic architecture of a trait and at the end of this tutorial we need to be able to answer the following questions:

  • How many genes/QTL ?
  • Where are the genes/QTL in the genome?
  • What is the effect on the phenotype (positive or negative)?
  • What is the gene action (dominant? additive?) ?
  • How much of the phenotypic variance is explained by each marker associated to the gene/QTL?
  • Is there any previous biological information?

Short Introduction

What is GWAS analyses?

See the example below:

  • We have a a single chromosome with 10 SNPs
  • We have an hyphotetical major QTL close to the SNP8 and SNP9
  • Let’s assume the linkage disequilibrium between the QTL and the markers in this toy example is created by physical linkage and not for any other biological source
  • How can I quantify the association between a QTL and SNPs?
Toy example illustrating association between markers and a QTL with large effect

Toy example illustrating association between markers and a QTL with large effect

Regression Analyses

  • Depending on the causal connections between two variables, their true relationship may be linear and can be described using a linear regression

  • Examples:
    • Biological sex (x) associated with salary (y)
    • Product attributes (x) associated to number of sales (y)
    • Fertilization (x) associated to yield (y)
    • Genotypic value (x) associated to the gene content (y)

The simplest regression models is linear with a single predictor:

\[y=\beta_0+\beta_1x+e\]

  • The terms \(\beta_0\) and \(\beta_1\), are the intercept and slope of the model, respectively.

  • Intercept is the point at which the line crosses the y axis at x = 0. It is also the mean, specifically the conditional mean, of y for individuals with values of 0 on x.

  • The slope expresses the relationship between y and x. Positive slope values indicate that larger values of x are associated with correspondingly larger values of y, while negative slopes mean that larger x values are associated with smaller y values. Holding everything else constant, larger values of slope (positive or negative) indicate a stronger linear relationship between y and x.

  • Error represents the random error inherent in any statistical model, including regression. It expresses the fact that for any individual, the model will not generally provide a perfect predicted value of y.

Estimation

In all real-world contexts, the population is unavailable to the researcher. Therefore, \(\beta_0\) and \(\beta_1\) must be estimated using sample data taken from the population. The statistical literature describes several methods for obtaining estimated values of the regression model parameters given a set of x and y.

Geometric Representation

  • What is the best slope and intercept for this example?
  • We can test the different slopes and intercepts combinations (gray)
  • Best approach is the line that minimize the difference between observed and predicted results.
Toy example of simple linear regression

Toy example of simple linear regression

Ordinary Least Squares (OLS)

  • By far, OLS is the most popular and widely used approach for inference of slopes and intercept in regression analyses.
  • The goal of OLS is to minimize the sum of the squared differences (RSS) between the observed values of y and the model predicted values of y across the sample.

\[\hat{\beta} = (X'X)^{-1} X'y\] Below the OLS in action:

Mixed Model Framework

  • Simple regression suffer of high rate of false positives, that is, declare significant even though it is not actually true.
  • Most modern GWAS software use mixed models (MLM)
  • Why? Better control of false positives

Naive example on the importance to use cofactors in regression model

  • 90 students in the classroom
  • Relationship between hair size and height of the students
  • Statistical Model: hair_size = mean + height + error

              Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 38.8703831 3.60057792 10.795596 8.503132e-18
hight       -0.1845235 0.02065016 -8.935693 5.568133e-14
  • We took notes about the gender.
  • New model: hair_size = mean + gender + height + error
  • On average, girls (red dots) have longer hair and are shorter than boys (blue dots).
  • Slope is not significant anymore

                      Estimate Std. Error    t value     Pr(>|t|)
(Intercept)        5.577631689 4.45579195  1.2517711 2.140084e-01
as.factor(cor)red  2.931686390 0.31974835  9.1687303 2.026746e-14
hight             -0.002799516 0.02474259 -0.1131456 9.101758e-01

Population Structure

Q + K linear mixed model

  • Under the mixed model framework of MLM,control of false positives is realized by having both a fixed effect of population structure and a random effect of polygenic background that is defined by the kinship
  • Important paper presenting the Q+K model (Yu et al., 2006)
  • The SNP-trait association analyses were based on a linear mixed model, accounting for population structure (Q) and relative kinship (K) matrices as implemented in the GWASpoly R-package

\[y = ZS\tau + ZQv + Zu + e \] Where:

  • \(y\) = is a vector of observed phenotypes
  • \(v\) = is a fixed effect of sub-population effects, can be defined as PCA or via STRUCTURE
  • \(u\) = is a random polygenic effect, with a multivariate normal distribution with a zero mean vector and VCOV matrix proportional to a kinship matrix (K-matrix)
  • \(\tau\) = fixed effect associated to the SNP information

Multiple Test Corrections

  • In part, this is because testing each SNP in the dataset results in thousands or even millions of statistical tests.
  • FDR (False Discovery Rate)
  • Bonferroni (\(\dfrac{\alpha}{N}\)), where N is the number of markers/tests
  • Permutation:
    • Phenotypic values are shuffled
    • Random association between markers and phenotype is estimated
    • Minimal p-value is recorded
    • Repeat many times, until create our “own null hyphotesis” and define our “own 95% quantile”

Software

  • There are multiple GWAS software reported in the literature
  • Below, some software that I have some experience and my personal opinion
  • Interesting test software with different models implemented (ex: single-SNP vs. multiple SNP)
Method Language Link Comments
GCTA UNIX Link Large family of software used for associations analyses.
GEMMA UNIX Link Very fast. Ideal to run in parallel for many traits. Easy to expand for Bayesian analyses (BSLMM model, which I particularly enjoy!!)
rrBLUP R Link Easy to run and it uses the same input for GS analyses. Nice tool as a starting point
FarmCPU R Link Gaining more popularity in literature. Join the advantages of mixed linear model and stepwise regression
GWASpoly R Link Ideal for polyploid analyses. Contains a series of genetic parameterizations and graphical analysis for polyploids
PLINK UNIX Link Tradtional software. Many functionalities for filtering and ploting

Polyploid Parametrizations

  • In polyploid studies we can test different gene actions.
  • Genotypic value of an individual is estimated differently in polyploid and diploid species.
  • In autotetraploids, the higher number of alleles per locus reflects on different coefficients of dominance, increasing the range of genetic models to describe one-locus genotypic value.
  • For more details, see Rosyara et al., (2016)

See below some examples of potential gene actions tested for autotetraploid species (2n=4X):

OLS in action for a simple regresion model

OLS in action for a simple regresion model

Filtering Analyses

For more details, see Reed et al., (2018)

Single nucleotide polymorphism-level filtering

  • Call rate: The call rate for a given SNP is defined as the proportion of individuals in the study for which the corresponding SNP information is not missing. Ex: a call rate of 95%, meaning we retain SNPs for which there is less than 5% missing data.
  • Minor allele frequency (MAF): a large degree of homogeneity at a given SNP across study participants generally results in inadequate power to infer a statistically significant relationship between the SNP and the trait under study. MAF of 1% and 5% are commonly used in GWAS analyses.
  • Genotypic Frequency: maximum genotype frequency.
  • updog parameters: outputs from allele dosage analyses
  • HWE: in some cases, violations of HWE can be an indication of the presence of population substructure or the occurrence of a genotyping error.
  • Linkage disequilibrium (LD) pruning: eliminates a large degree of redundancy in the data and reduces the influence of chromosomal artifacts

Sample-level filtering

  • Call rate: Similar to SNP-level filtering based on call rate, we exclude individuals who are missing genotype data across more than a pre-defined percentage of the typed SNPs.
  • Cryptic relatedness: after created the K matrix, individuals with large relationship value are indications of duplicates. PCA analyses can also be used
  • Heterozygosity: excess heterozygosity across typed SNPs within an individual may be an indication of poor sample quality

Dealing with Missing Data

  • Imputation:
    • Beagle: very popular and used for diploid species
    • Simple metrics: mean values have been used as default for many software. In the case of the GWASpoly, population mode has been used as the default.

Graphical Analyses

Quantile–Quantile (QQ) plot

  • Quantile distribution of observed p-values (on the y-axis) on the quantile distribution of expected p-values (on x-axis)
  • Expected p-values have a random uniform distribution
  • We are looking for few of the p-values are in LD with a causal polymorphism and had significant p-values
Yu et al., 2016 – Simple: only the SNP effect; Q: Model with SNP + Q effects; Q+K: model with SNP + Q + polygenic effect in a mixed model framework

Yu et al., 2016 – Simple: only the SNP effect; Q: Model with SNP + Q effects; Q+K: model with SNP + Q + polygenic effect in a mixed model framework

Manhattan plot

  • Graphical tool to show significant hits associated with the trait under test
  • Each dot represents a SNP, ordered across the chromosomes, that was subjected to a regression (or mixed model) analyses
  • y-axis: log of p-value
  • Vertical line: multiple test correction
Physiological and behavioral trait loci identified in mice, from Parker et al, Nature Genetics, 2016.

Physiological and behavioral trait loci identified in mice, from Parker et al, Nature Genetics, 2016.

Hands-on

  • GWAS for firmness in blueberry
  • Phenotypes: BLUES-2014, BLUE-2015, BLUE both years, BLUP both years
  • Marker score: tetraploid allele dosage from updog
  • Model: Q+K GWAS model implemented in GWASpoly

Data set

  • Match genomic and phenotypic data
  • Check if they are in the same order
library(dplyr)
library(openxlsx)

# Matching genotype and phenotype
# i) Download the phenotypic and pedigree data with the genotypeID
load("/home/lferrao/Dropbox (UFL)/classes/gwas_tutorial/3.gwas/hands_on1/pheno_metrics.Rdata")
ped = read.xlsx("/home/lferrao/Dropbox (UFL)/PosDoc_UF_NewGenome/DataSet/Pedigree/PedigreeV59.xlsx") %>%  select("Clone", "RAPIDcode1")
table(phenotype2$genotype %in% ped$Clone) # all individuals are listed in the pedigree

# ii) Download the genomic data
load("/home/lferrao/Dropbox (UFL)/PosDoc_UF_NewGenome/DataSet/SNPs_CatherinePop_Rdata/2021/Breeding_program_6423samples/updog/updog_6423ind_59k_genomat.Rdata")
idx = intersect(colnames(geno), pheno1$RAPIDcode1)
length(idx) # there are some individuals not genotyped (poor DNA quality)

# 1726

# iii) Filtering the same ind (not in the same order)
pheno2 = pheno1 %>% filter(RAPIDcode1 %in% idx) %>% droplevels
geno2 = geno %>%  select(idx) %>% droplevels()
list(dim(geno2),dim(pheno2)) # same dimention

# 1726 ind

table(as.character(pheno2$RAPIDcode1) == colnames(geno2)) # Not in the same order

# FALSE

# iv) Matching both data set
pheno3= pheno2[match(colnames(geno2),pheno2$RAPIDcode1),] # order the pheno data, depeding on the genotypeID
table(pheno3$RAPIDcode1 == colnames(geno2)) 

# TRUE

# v) Create the gwas input file
# -- phenotype
tmp = pheno3 %>%   select(-genotype,-RAPIDcode1)
pheno.final = data.frame(pheno3$RAPIDcode1,tmp) 

# -- genotype
geno.final=data.frame(geno.map,geno2)
geno.final[1:5,1:10]

# Exporting the data
write.csv(pheno.final, "/home/lferrao/Dropbox (UFL)/classes/gwas_tutorial/3.gwas/hands_on1/phenotype.csv",
          row.names = F)
write.csv(geno.final, "/home/lferrao/Dropbox (UFL)/classes/gwas_tutorial/3.gwas/hands_on1/geno.csv",
          row.names = F)

Final input for phenotype

       pheno3.RAPIDcode1 blue2014 blue2015 blue_both blup2014 blup2015 blup_both  avg2014  avg2015 avg_both
1196 UFF_192502_P01_WA01 258.0301 240.2174  249.1237 250.6691 232.5669  241.6180 231.9533 266.2942 249.1237
1200 UFF_192502_P01_WA02 196.8033 178.9905  187.8969 219.4260 201.3237  210.3748 213.2558 162.5379 187.8969
1201 UFF_192502_P01_WA03 232.1425 214.3297  223.2361 237.4590 219.3568  228.4079 203.2501 243.2221 223.2361
1203 UFF_192502_P01_WA04 226.3855 208.5728  217.4792 234.5213 216.4191  225.4702 225.1066 209.8517 217.4792
1135 UFF_192502_P01_WA05 233.2825 215.4697  224.3761 238.0407 219.9385  228.9896 304.7860 143.9661 224.3761
1136 UFF_192502_P01_WA07 291.2868 273.4740  282.3804 267.6394 249.5372  258.5883 271.2847 293.4761 282.3804

Final input for the genotype

         snp.name chr    pos UFF_192502_P01_WA01 UFF_192502_P01_WA02 UFF_192502_P01_WA03 UFF_192502_P01_WA04
1_74957   1_74957   1  74957                   1                   2                   2                   3
1_74985   1_74985   1  74985                   1                   2                   2                   3
1_130006 1_130006   1 130006                   4                   4                   4                   4
1_130024 1_130024   1 130024                   1                   2                   2                   2
1_137968 1_137968   1 137968                   4                   4                   4                   4

Reading the data

## Number of polymorphic markers: 59901 
## Missing marker data imputed with population mode 
## N = 1726 individuals with phenotypic and genotypic information 
## Detected following traits:
## blue2014
## blue2015
## blue_both
## blup2014
## blup2015
## blup_both
## avg2014
## avg2015
## avg_both

Define parameters

LOCO vs. non-LOCO method:

  • The use of the polygenic term in the form of K matrix is equivalentequivalent to including all of the markers as random effects, since all markers are used to create the polygenic term.
  • This kind of redudance is so-called called “proximal contamination” and can reduce the model performance.
  • The idea of LOCO (leave-one-chromosome out) is to compute K-matrix for each chromosome based on the markers from all other chromosomes.
  • From my personal experience, I have seen clearer manhattan plots, with tower-like peaks, at the cost of a little QQ-plot inflation. In short, I think it’s promising.

PCA methods

  • PCA are used as fixed effect.
  • In many situations, a mode contains only the polygenic term (K matrix) is sufficient to control population structure.
  • A recent paper addressed this issue and confirms that the use of only the polygenic effect, in most of the cases, had a better callibration than including the PCAs (Yao et al., 2022).

GWAS models

Manhattan plots

Connecting the dots

  • Three phenotypic metrics:
    • blues: fitted as asreml(fixed=firm ~ year + tree, data=firmness) and with the marginal means estimated per years
    • blups: fitted as areml(fixed=firm ~ year, random = ~ tree, data=firmness) and with the marginal random effects predicted per year
    • avg: average values across years
## Thresholds
##           general additive diplo-general diplo-additive 1-dom-alt 1-dom-ref
## blue2014     6.08     6.08          6.08           6.08       5.6      6.05
## blue2015     6.08     6.08          6.08           6.08       5.6      6.05
## blue_both    6.08     6.08          6.08           6.08       5.6      6.05
## blup2014     6.08     6.08          6.08           6.08       5.6      6.05
## blup2015     6.08     6.08          6.08           6.08       5.6      6.05
## blup_both    6.08     6.08          6.08           6.08       5.6      6.05
## avg2014      6.07     6.08          6.08           6.08       5.6      6.05
## avg2015      6.07     6.08          6.07           6.08       5.6      6.05
## avg_both     6.08     6.08          6.08           6.08       5.6      6.05
##           2-dom-alt 2-dom-ref
## blue2014       5.79      5.97
## blue2015       5.79      5.97
## blue_both      5.79      5.97
## blup2014       5.79      5.97
## blup2015       5.79      5.97
## blup_both      5.79      5.97
## avg2014        5.79      5.97
## avg2015        5.79      5.97
## avg_both       5.79      5.97

Questions:

  • We will focus only on the additive effect
  • Do you expect differences across the blues values? Why?
  • Do you expect differences between blues and blups? Why ?

Important:

  • When fitting phenotypic models we need to know what we are correcting for. In the case of the blues and blups values, when an explicity fixed effect is created for the year effect we are correncting for variations associated to the year. Simply stating, we are removing a “constant value” that is part of the year variation from both years. It means that marginal BLUEs or BLUPs extracted from the model will have the same ranking, but a “constant value” representing the year effect is removed per year.
  • blue and blup values are corrected per year. avg values are capturing eventual GxE effects
  • As we discussed in the tutorial 1, we do not expect large differences between blues and blups when blups are computed using an Identity matrix

Other gene actions

  • What is the importance of testing different gene actions?
  • Why we have different results in 2014, 2015 and combining both years?
  • What is the most important result for breeders?

Genetic Architecture

By understanding the genetic architecture, we need to answer:

  • How many genes/QTL ? getQTL
  • Where are the genes/QTL in the genome? getQTL
  • What is the effect on the phenotype (positive or negative)? getQTL
  • What is the gene action (dominant? additive?) ? getQTL
  • How much of the phenotypic variance is explained by each marker associated to the gene/QTL? fit.QTL
  • What is the polymorphism? vcf file
  • Is there any previous biological information? next tutorial
Trait Model Threshold Marker Chrom Position Ref Alt Score Effect
16293 avg_both additive 6.08 8_13021047 8 13021047 0 1 6.31 -5.225137
28395 avg_both diplo-general 6.08 2_36707557 2 36707557 0 1 6.38 NA
19871 avg_both diplo-general 6.08 9_5347529 9 5347529 0 1 6.16 NA
198711 avg_both diplo-additive 6.08 9_5347529 9 5347529 0 1 6.90 10.987334
198712 avg_both 1-dom-ref 6.05 9_5347529 9 5347529 0 1 6.96 11.278177
22830 avg_both 2-dom-ref 5.97 9_24397156 9 24397156 0 1 6.26 -28.600094

Proportion of the phenotypic variance

  • R2 statistic: coefficient of determination, is one of the most frequently used measures of prediction power and goodness-of-fit for simple linear regression models. R2 -like statistics complement statistical testing by providing practitioners with a more intuitive measurement than the P-value from other statistical tests.

  • GEMMA author suggested to take a look on the ratio between vg and ve for the alternative model (l_remle=vg/ve) reported in the l_remle collumn.

  • Formula 1 presented here : \((\beta^2 \times var(x))/var(y)\), where var(x) is the variance of the genotype at the focal SNPs, and var (y) is the variance of the phenotype. \(\beta\) is reported for each SNP in the LMM output.

  • Formula 2 presented here: pve <- var(x) * (beta^2 + se^2)/var(y)

  • Formula 3 presented here as: \(V_g = 100 \times \dfrac{2pq\beta^2}{\sigma^2_a}\), where, p and q are the allele frequencies for a given trait, \(\beta\) is the additive effects of the SNPs, and \(\sigma^2_a\) is the additive genetic variance for a trait

GWASpoly approach

  • GWASpoly use a multiple QTL model approach and estimate the partial R2 for each QTL based on backward elimination.
  • Backward Regression:
    • Specify a full model with all QTL
    • Each step gradually eliminates variables (ie., QTL) from the regression model using a given method (ex: AIC, p-value or other)
    • Find a reduced model that best explains the data
          Marker Chrom Position    Model         R2        pval
16293 8_13021047     8 13021047 additive 0.01439076 7.55507e-07
19871 9_5347529     9  5347529 1-dom-ref 0.01649576 1.16579e-07
  • The QTL on chr08 explains 1.4% of the variation, has an additive gene action and a negative effect on the pehnotypic variation of firmness
  • The QTL on chr09 explains 1.6% of the variation, has an dominance gene action and a positive effect on the pehnotypic variation of firmness
  • pval presented in this table is not the same log(p-value) presented in the manhattan plot. The pval reported with the R2 refers to the backward regression and the significance on including a given marker in regression linear model.

Ploting the results

Conclusions

  • With this tutorial we finished the statistical part related to finding associations.
  • The next part is more related to gene mining and the biological meaning of the results. Very important!! As I mentioned in the introduction, in our lab we have a statistical geneticist and a computer biologist working side by side to draw conclusions.
  • Be careful when using the codes presented here. Be curious and try to adapt the ideas for your project. Make questions. Create hypotheses. Test other methods and software.

References

  • Ferrão et al., (2018). Insights Into the Genetic Basis of Blueberry Fruit-Related Traits Using Diploid and Polyploid Models in a GWAS Context. Link
  • Ferrão et al., (2020). Genome-wide association of volatiles reveals candidate loci for blueberry flavor. Link
  • Rosyara et al., (2016). Software for Genome-Wide Association Studies in Autopolyploids and Its Application to Potato. Link
  • Tutorial GWASpoly. Link